--- layout: page title: Script 1 - Intro permalink: /scripts/script1/ parent: R Scripts nav_order: 2 --- Script 1: Introduction and First Steps in R

This is an introduction to using R, with the aim of learning how to analyse data using real-world examples. We will first learn the basic syntax of R with some very basic toy examples, then proceed to use data from a published paper to learn how to explore larger datasets.

Run the entire code in this script in your own local R Studio script. You don’t necessarily have to do this here, but I strongly recommend you always annotate your code including the output or result of the code you run unless it is trivial or already discussed below. Make sure to be explicit about output and substantive results when responding to questions and tasks.



Main Operators Introduced in this Session

<- # assigns values to an object
$ # selects a variable in a dataset
& # logical "A and B"
| # logical "A or B"
== # logical "A equal to B"
!= # logical "A not equal to B"
%in% # logical "A is contained in the set B"

Main Commands Introduced in this Session

c() #combines objects into a vector
data.frame() #binds vectors into a dataframe
length() #length of a vector/variable
unique() #unique values in a vector/variable
table() # tabulate values of a vector/variable
mean() # calculate the mean of a variable
View() #opens dataframe in new window
nrow()# number of rows in a dataframe
names() # variable/column names
tapply() # apply a function to subgroups
subset() # isolate a subset of data from a matrix/data.frame
levels() # identify levels of a variable
ifelse() # apply a function to subgroups

The Basics of the R language

Basic Syntax: Operators, Functions, and Objects

At its most basic, you can use R as a calculator: type out mathematical operations in the script section after you have opened a script (the field in the top left of the interface), highlight the chunk of code you want to run, then press the Run button. You will get the result in the console section (the field in the bottom left of the RStudio interface). You can also tell R to run your code with ctrl+enter instead of pressing Run.

2+2

Note that R syntax is not sensitive to white space: 2+2, or 2 + 2, or 2 +2 are equivalent. Similarly, at any point you can continue your code on a new line: R won’t mind.

The mathematical operators in R are pretty intuitive:

-3 * 5
(20-6) / 2
7^2
64^(0.5)

R evaluates mathematical operators in the order you’ve been taught at school: parentheses first, then exponentiation, then multiplication and division, then addition and subtraction (PEMDAS). So, if you want to call \(4+\frac{2}{5+13}\), you would run

4+2/(5+13)

While if you want to call \(\frac{4+2}{5+13}\), you would run

(4+2)/(5+13)

This is a good place to introduce your first functions. Functions in R are expressions that come with parentheses and within these parentheses you pass one or more arguments. The following lines of code are some simple mathematical functions, where you pass one numerical argument. The functions then return the result of an operation:

abs(-10) #computes the absolute value
sqrt(81) #computes the square root
log(4) #computes natural logarithm
exp(0.5) #computes the exponential function (e to the power of a number)

Sometimes you have to pass more than one argument to the function. We then use commas to separate the arguments:

log(8, base = 2) #computes the logarithm of a number in base 2
round(123.456789, digits = 2) #rounds a number to the 2nd decimal place
round(-123.456789, digits = 4) #rounds a number to the 4th decimal place

When you become familiar with a function and the arguments it takes, you won’t need to specify the argument’s name. In many ways, R becomes simpler to use the more familiar you are with the functions. For instance, the expressions above are equivalent to:

log(8, 2) #computes the logarithm of a number in base 2
round(123.456789, 2) #rounds a number to the 2nd decimal place
round(-123.456789, 4) #rounds a number to the 4th decimal place

If you want to know more about what a function does, you can type ? and then the function, and run the code. Or, alternatively, use the help() function. This call will open the help window in the bottom right of your interface, with descriptions and examples of how the function can be used. For instance:

?log
help(log)

The principle of using functions can look complicated at first, but actually is just a combination of logical operators. You can therefore nest functions and operations within each other in a single expression:

sqrt(10*5-1)
round(log(5+5)*2, digits = 3)

It is often useful to “store” values as named objects, using the all-important <- (assign) operator. These values can be numbers or character strings; in the latter case you have to surround the text with quotation marks.

Note that = was an alternative assignment operator. However, this is deprecated, because you are going to use the equal sign in other contexts to do other things. You will see it used in other people’s code, but to make things as unambiguous as possible, get used to using only <- for assignment.

x <- 7
y <- 4+4
my_course <- "Data Analysis in R"

When you assign values to a named object, R does not return the value in the console, but it will appear in the global environment window in the top right of the interface. This means that you have successfully stored this value, and at any point you can simply run your object’s label (e.g. my_course) and R will return the value you have assigned to it.

Note that you can access the object from any script. Note also that the object’s label needs to be continuous text, so instead of using my course, you want to use my_course with an underscore.

You can pass named objects as arguments of functions or operations, just as we did numbers:

sqrt(x)
(x*y)/2

But obviously it’s not a particularly good idea to run

my_course+5

rather, you’d want to return the object itself:

print(my_course) # or simply
my_course

The Building Blocs: Vectors and Dataframes

In data analysis, we normally work with variables taking a series of values across a number of observations rather than a single unit only. R handles these sequences as vectors, and you can create your own vectors using the c() function (combine) - in fact, you have already been using vectors all along: the objects we worked with so far are simply vectors of length 1!

Obviously, we can also create larger vectors by combining several values:

prime_numbers <- c(2,3,5,7,11)
zero_to_ten <- c(0,1,2,3,4,5,6,7,8,9,10)
friends <- c("Rachel", "Monica", "Joey", "Chandler", "Ross", "Phoebe")

Some other ways to create vectors:

one_to_onehundred <- 1:100
#creates a vector with all integers between 1 and 100

the_decimals <- seq(0, 1, by = 0.1)
#creates a vector with all numbers from 0 to 1 by intervals of 0.1

ones <- rep(1, 100)
#creates a vector repeating the first argument (1), n times,
#where n is the second argument (100).

lots_of_friends <- rep(friends, 5)
#as above but here the first argument is the vector "friends".

You can now pass the functions and operations described above to the whole sequence. The function is applied to each element of the original vector, and R will compile a vector with the sequence of the corresponding results:

squares_of_primes <- prime_numbers^2
ten_to_twenty <- zero_to_ten+10

Other functions return a single value computed from all the elements in the vector:

length(zero_to_ten) #returns the number of items in a vector
sum(zero_to_ten) #returns the sum of elements of the vector
mean(zero_to_ten) #returns the mean of elements of the vector
median(zero_to_ten) #returns the median of elements of the vector
max(zero_to_ten) #returns the maximum value in the vector
min(zero_to_ten) #returns the minimum value in the vector

You can also ask R to evaluate each element of a vector according to a logical expression. This will return a vector of logical values TRUE/FALSE. Stick a pin on this because it will be very useful when we index and subset data frame variables later. For instance:

zero_to_ten > 6 #is the element larger than six?
zero_to_ten >= 4 #is the element larger or equal than four?
zero_to_ten == 1 #is the element a one?

Note the double equal operator used in the last example. Annoyingly, when you want to ask R if an object is equal to something, you have to use ==, the single = is only to be used within functions. Using = instead of == is probably the most common syntax error for new learners and experienced coders alike; so when R spits out an error, double-check the equality signs in your code.

Finally, you can use the unique() function to get a vector of the unique elements occurring in a vector, and the table() function to see how many times a value occurs.

fruit <- c("apple", "pear", "pear", "orange",
"banana", "apple", "pear", "banana",
"pear", "apple", "apple", "banana")

unique(fruit)
table(fruit)

#note that the output of table() is not a vector, but a different class of object, a table.

Just like vectors are collections of objects, we can create collections of vectors: dataframes. You can do so by passing your vectors in the data.frame() function: By default, your vectors will be treated as columns, and the vector names will become column names.

zero_to_ten
squares <- zero_to_ten^2
my_dataframe <- data.frame(zero_to_ten, squares)

#visualise my_dataframe in the console simply by calling:
my_dataframe

#visualise my_dataframe in a viewer window:
View(my_dataframe)

#you can pass as many arguments to data.frame() as you want:
roots <- sqrt(zero_to_ten)
my_dataframe <- data.frame(zero_to_ten, squares, roots)

#you can give your dataframe column names within the data.frame() function, for instance:
my_dataframe <- data.frame(zero_to_ten, cubes = zero_to_ten^3)

Note that when you create an object with the same name of an existing object (in this case, my_dataframe), the previously saved object will be overwritten - if in doubt, choose a new name!

Be careful with the length of your vectors when using the data.frame() function. If the vectors are not of the same length but multiples of the other, the smaller vector will be replicated (or “looped”) when you bind them. If the two vector lengths are not even multiples of each other, you will not be able to bind them into a single data frame:

length(fruit)
#what's the length of my vector?

data.frame(fruit, binary_variable = c(0,1))
#the second vector is of length 2, which is a divisor of 12, so it gets replicated.

data.frame(fruit, binary_variable = c(0,1,0,0,1))
#the second vector is of length 5, which is *not* a divisor of 12, so you get an error.

Dataframes can take text, numerical and logical vectors. As long as they’re the same length, you’re good to go:

students <- c("Student A", "Student B", 
              "Student C", "Student D")
grade <- c(55, 70, 35, 81)
pass <- grade >= 50

data.frame(students, grade, pass)

To recap: objects can be numbers (1, 2, 3), text strings (“banana”, “apple”) or logical values (TRUE, FALSE). Vectors are collections of objects that you combine with the c() function. Dataframes are collections of vectors that you create with the data.frame() function.


Loading Data in R

Most of the times, you will not create your dataframe from scratch: you will work with datasets available out there in the wild. So before proceeding to look at all the things you can do with dataframes, let’s first see how to import a dataset into R.

Loading data in RStudio (Desktop)

We will use the leaders dataset, which we will need to load directly into R. First we need to set our working directory, where we can save our R-scripts, as well as any data used in future scripts. This is an important first step, don’t skip it if you haven’t already done this!

• Find a location on your computer where you would like to save your work for this course. I strongly encourage you to create a folder for this class.

• Download the leader.csv data file from github or the button above, and save it into the script1 folder you just created.

• Open a new R-Script. Type the code below into the first line of the script, replacing the file-path with the specific path on your computer. setwd() tells the R-Session where to begin looking for any files. For instance, your file-path could like something like this:

setwd("C:/User/Desktop/Data_Analysis_R/script1")

To load the leaders dataset, we will use the read.csv() function; and the assignment operator <- to give it a name and save it as an object in our Global Environment.

leaders <- read.csv("leaders.csv")

Note that you can also directly call the file by indicating its complete path in the read.csv function.

Working with Dataframes

Some Information about our Dataset

The dataset we will be using in the rest of this script has been compiled for a paper by Jones and Olken (2009) that aims to answer empirically the question of whether political assassinations can change the course of history. We can debate individual cases on their own historical merits:

• Did the assassination of Archduke Franz Ferdinand cause World War One, or was Europe set on the path of conflict regardless?

• Did the assassination of Rwandan President Habyarimana cause the Rwandan genocide?

• Was the 1908 Lisbon regicide causally related to the democratisation of Portugal in the following years?

But as (aspirational) political scientists - at least for today -, we are also interested in asking if on average assassinations have an effect on the likelihood of e.g. wars or regime changes, and – if so – in what direction the effect works.

The problem with testing empirically questions like these is that assassinations do not happen at random and, as a result, there are many confounding factors that need to be accounted for to infer the actual cause of the outcomes.

We may look at a dataset on occurrences of leaders’ assassinations and wars in every country and every year, and find a positive relationship between assassinations and wars in the following years. But how can we tell if one is the consequence of the other, as opposed to both being caused by the same thing – for instance, social unrest or economic downturns?

Jones and Olken propose a solution: instead of comparing all the cases where we observe a leader’s assassination with all the cases where we do not observe a leader’s assassination, they compare successful assassinations and failed attempts. Their argument is that whether assassinations turn out to be “hits” or “misses” is down to chance: whether a bullet hit a vital organ or just brushed past the target, whether due to a last minute change of plans the target was inside or outside the range of an explosion etc.

To the extent that this assumption holds, a sample of successful and unsuccessful assassination attempts constitutes a natural experiment, wherein our treatment (a successful assassination) is assigned as-if random to some countries in some years.

If you want to read more about what they found, see:

Jones, Benjamin and Benjamin Olken. 2009. “Hit or Miss? The Effect of Assassinations on Institutions and War.” American Economic Journal: Macroeconomics 1(2): 55–87, http://dx.doi.org/10.1257/mac.1.2.55.

Describing the Dataframe

It is important to get a sense of a dataset’s structure whenever we’re working with a new or unfamiliar dataset.

• The following commands are helpful for getting acquainted with new data.

dim(leaders) # number of observations and variables
names(leaders) # variable names
head(leaders) # prints first six rows of the dataset
tail(leaders) # prints last six rows
summary(leaders) # provides a quick summary of each variable in the dataset
View(leaders) # visualises the dataset

Each observation of the leaders dataset contains information about an assassination attempt.

Name Description
country The name of the country
year Year of assassination attempt
leadername Name of leader who was targeted
result Result of the assassination attempt (one of 10 categories)
age Age of the targeted leader
politybefore Average Polity score during the 3 year period prior to the attempt
polityafter Average Polity score during the 3 year period after the attempt
civilwarbefore 1 if country was in civil war during the 3 year period prior to the attempt, 0 otherwise
civilwarafter 1 if country was in civil war during the 3 year period after the attempt, 0 otherwise
interwarbefore 1 if country was in interstate war during the 3 year period prior to the attempt, 0 otherwise
interwarafter 1 if country was in interstate war during the 3 year period after the attempt, 0 otherwise

Our key variables of interest are the following: The polity variables represent an index of democracy. The Polity score documents and quantifies the regime types of all countries in the world since 1800. The Polity score is a 21-point scale ranging from -10 (most autocratic) to 10 (most democratic). The result variable is a 10 category factor variable describing the result of each assassination attempt.

As we know, the number of assassination attempts in the data is equal to the number of observations. But how would we find this information in the dataset?

nrow(leaders) # the number of rows/observations

Learning about our Data

An essential operator when we work with dataframes is $ (extract). It allows us to select one variable in the dataframe to work with. So, for instance if the column we’re looking for is country in the dataframe leaders, we call:

leaders$country

And we get a vector of countries in the order in which they appear in the dataset. Now we can apply to this vector all the functions we learned earlier (and more)!

For instance, let’s see how many times each country occurs. We can use our friend the table() function for this.

table(leaders$country) # number of times each country appears in the dataset

table(leaders$leadername) # number of times each leader appears in the dataset

table(leaders$age) # number of times each leader's age appears in the dataset

When you combine the table() function with the powers of logical expressions in R, you can gain even more insights into your data:

leaders$age < 30
#returns a vector taking the value of TRUE if age is below 30, and FALSE otherwise.

table(leaders$age < 30)
#returns number of leaders that are under 30 years of age.

prop.table(table(leaders$age < 30))
#you can use prop.table to express the table values as percentages.

table() gave us the number of times each country appears in the dataset, but what if we want to know the number of unique countries that are in the dataset?

unique(leaders$country)
#unique() returns a vector with unique values of variable.

length(unique(leaders$country))
# length() computes the length of the vector

Let’s now analyse a numerical variable. Which years does our dataset cover?

table(leaders$year)

min(leaders$year) # first year in the dataset: minimum value of `year`

max(leaders$year) # last year in the dataset: maximum value of `year`

summary(leaders$year) # summary statistics of `year`

How frequent are assassination attempts? How would we calculate this from our data?

number_of_attempts <- nrow(leaders)

time_span <- length(min(leaders$year):max(leaders$year))

assassinations_per_year <- number_of_attempts/time_span

# or, all in one expression:
nrow(leaders)/length(min(leaders$year):max(leaders$year))

Here, we take the number of assassination attempts and divide them by the time span between earliest and latest year included in the data.


Indexing and Subsetting Data

Oftentimes, we want to work with only a fraction of our data, such as observations in a particular time period or in a particular region. Subsetting and indexing allow us to call and isolate particular observations, or subsets, of the data. We can call specific subsets of data or the contents of individual cells.

Indexing

The leaders dataset is saved as a data.frame, and we can access variables and observations using commands in the following ways:

• We can use square brackets, in the form of data[row,column], to call the contents of one or more cell.

leaders[1,1] # access the first row and the first column
leaders[1:5,] # show the first five rows for all columns
leaders[,1:5] # show the first five columns for all rows

• As we have seen, to call particular variables in a data.frame, we use the $ notation in the following way: data$variablename

leaders$country # $ notation to call specific variables from the dataset

leaders$leadername[200] # view leadername of the 200th observation

• We can also call variables by name using quotation marks within square brackets [,"varname"].

leaders[1:4, c("country","leadername")] # view rows 1 to 4 of multiple variables

• We can pass logical expressions in square brackets, so as to index one variable conditional on the value of another variable.

leaders$leadername[leaders$country=="Afghanistan"]
# displays name of leaders where `country` is Afghanistan

Note what’s going on “behind the scene” in the last example: the expression in the square brackets leaders$country=="Afghanistan" is a vector that takes the logical value of TRUEwherevere the country of an observation is Afghanistan, just like this:

leaders$country=="Afghanistan"

So when we index leaders$leadername by this logical expression, R is returning the elements of leaders$leadername in positions 1, 2 and 3, which are the positions for which the expression in the square brackets is TRUE. We can show that this is true by comparing these two indexations of our vector:

leaders$leadername[leaders$country=="Afghanistan"]
leaders$leadername[c(1,2,3)]
# the two are in fact the same!

How would you find the leadername of leaders below the age of 30?

leaders$leadername[leaders$age<30] #prints the name of leaders under 30

Furthermore, we can also perform mathematical operations on the indexed values.

• To calculate the mean age for all leaders that “died within a day” after assassination attempts, we simply append the mean() command to the indexing statement.

mean(leaders$age[leaders$result=="dies within a day after the attack"])

• To do the same for all leaders other than those who died after a day, we use the != operator (not equal to) instead of ==.

mean(leaders$age[leaders$result!="dies within a day after the attack"])

• To calculate the mean age of leaders across all possible values of the variable result, we can use the tapply() (table-apply) function. The basic syntax of tapply() is tapply(target_variable, subgroups_variable, the_function_you_want_to_apply).

tapply(leaders$age, leaders$result, mean)

Note that inexing does not alter the underlying data frame. It merely is a way to present partial, conditional data and allows us to work with them by using corresponding operators.

Subsetting

Instead of indexing through square brackets, we can also use the subset() function to construct a new data.frame that contains just the selected subset of observations from the original dataset. This function is more intuitive, but unlike mere indexing, it splits the dataset and creates a new one in the Global Environment.

• Let’s start with a very basic subset. Create a dataset with only leaders from Afghanistan. We use the form newdataframe <- subset(existing_data.frame, selection_argument).

afghanistan<-subset(leaders, leaders$country=="Afghanistan")

#we can also simply write country, as we already have supplied the name of our dataframe

afghanistan<-subset(leaders, country=="Afghanistan")
afghanistan # prints the new subset

#note once again the "==" operator for logical equality.

Let’s now try and subset conditional on multiple conditions. To do so we need the & (and) operator. Say, we want a dataframe of assassination attempts in the interwar period, between 1918 and 1939 (including both years).

interwar_attempts <- subset(leaders, year>= 1918 & year <= 1939)
interwar_attempts

Or a dataframe of assassination attempts in China and Japan, before 1930. Here we can use the & operator in conjunction with the %in% operator. That is, we have to conditions: First, the year is before 1930, and second, the observation’s country name is in the list we provided.

chinajapan_attempts <- subset(leaders, country %in% c("China", "Japan") & year < 1930)
chinajapan_attempts

# Note, you can get the same result using logical conditions only using the | (or) operator
chinajapan_attempts <- subset(leaders, (country=="Japan" | country=="China") & year < 1930)

Creating New Variables

We can also use indexing to create new variables. Let’s create a dichotomous (only taking values of 0 or 1; also called “dummy”) variable that captures whether a leader survived or died as a result of the assassination attempt.

What are the possible values of result?

leaders$result # prints all the values
table(leaders$result) # tabulates all the values
unique(leaders$result) # returns unique values
levels(leaders$result) # returns unique levels

• Note: the difference between unique() and levels() is that the former returns all the values that effectively appear in the variable; the latter returns all the values that the variable can take, i.e. including those that appeared in the original dataset but not in a given subset. It’s a subtle difference, but it’s worth remembering it, as when you will have to manipulate a variable, you may need to change its levels so as to allow it to take new values or to remove unused levels. Also note the different order in which unique() and levels() return their result: unique() gives you the values in the order in which the function “finds” them; levels() returns them in the order the coder decides to assign to each level (the default is alphabetical order).

Recode the result variable to create a new indicator of deaths by assassination.

leaders$died <- NA # first, create an empty vector

# fill this empty vector with 1 if the leader died
leaders$died[leaders$result=="dies between a day and a week"] <- 1

leaders$died[leaders$result=="dies between a week and a month"] <- 1
leaders$died[leaders$result=="dies within a day after the attack"] <- 1
leaders$died[leaders$result=="dies, timing unknown"] <- 1

# fill with 0 if they survived
leaders$died[leaders$result=="hospitalization but no permanent disability"] <- 0
leaders$died[leaders$result=="not wounded"] <- 0
leaders$died[leaders$result=="plot stopped"] <- 0
leaders$died[leaders$result=="survives but wounded severely"] <- 0
leaders$died[leaders$result=="survives, whether wounded unknown"] <- 0
leaders$died[leaders$result=="wounded lightly"] <- 0

It is sometimes easier to use the ifelse() command to perform operations on different levels of a variable. The ifelse() command has a three part structure:

• First, index a variable using a logical expression (a condition, if you will);

• Second, assign an operation if the expression is TRUE;

• Third, assign another operation if the expression is FALSE.

ifelse() works row-by-row, checking if the logical condition is met, and then either assigning the results of the TRUE or FALSE argument respectively. It’s a lot shorter to write, and more intuitive too!

In the below code, we create a variable that is identical to leaders$died, except we write fewer lines of code. We tell R to assign 1 to all rows in the data where the corresponding value of the result variable are %in% (literally in) the set of assassination values, which we store as a vector using c(). If the value is not in that list, then the observation’s value for success is assigned 0 (i.e. they were not assassinated) as the logical statement is FALSE.

leaders$success <-
  ifelse(leaders$result %in% c("dies between a day and a week",
                               "dies between a week and a month",
                               "dies within a day after the attack",
                               "dies, timing unknown"), 1, 0)

# if you want to be absolutely sure the variables are the same, check:
# leaders$died == leaders$success

How “successful” are assassination attempts?

mean(leaders$died)
mean(leaders$success)

Just as we saw with subsetting, you can also recode a new variable conditional on multiple other variables. For instance, our dataset has two variables interwarbefore and civilwarbefore taking the value of 1 if the country was experiencing a civil or an interstate war prior to the assassination attempt, respectively, and 0 otherwise.

If we want to synthesise this information in a single variable that records whether a country was experiencing any type of war, we can use ifelse() in combination with the | (or) operator.

leaders$warbefore <- ifelse(leaders$interwarbefore == 1 |
leaders$civilwarbefore == 1, 1, 0)

Repeating the same exercise for interwarafter and civilwarafter

leaders$warafter <- ifelse(leaders$interwarafter == 1 |
leaders$civilwarafter == 1, 1, 0)